Predictive Capabilities of LASSO Regression

Kayla Hayes, Kristy Tarano, HJ Kim, Andrew Melara-Suazo

2024-07-30

Introduction & Literature Review

  • Diagnosing genetic disorders has been a significant challenge for thousands of years. Genetic predispositions are now a prime focus for medical professionals to make faster diagnoses (Claussnitzer et al. 2020).

  • Substantial time and resources are required to analyze genomes which has spurred advancements in technology (Y. Liu et al. 2023).

  • Diagnoses still often hindered by inefficiencies in computerized decision-making systems (Tamburrano et al. 2020).

  • Machine learning for assistance in diagnosis of genetic disorders is now more commonplace, with linear models more accurately predicting disorder outcomes through targeting gene mutation and expression patterns (Raza et al. 2022) (S. Liu et al. 2019).

  • Logistic regression, optimized with LASSO (Least Absolute Shrinkage and Selection Operator) regression, simplifies models and improves the selection of independent variables to address issues, such as those below, thereby enhancing predictive precision (Rusyana, Notodiputro, and Sartono 2021) (Ranstam and Cook 2018):

    1. Overfitting
    2. Overestimation
    3. Multicollinearity
  • This study employs the use of demographic and health indicators to predict disorder sub-classes by combining LASSO and multinomial logistic regression.

  • Focuses on key information to allow better understanding of the factors influencing diagnoses and improve diagnostic efficiency.

Methods

Multinomial Logistic Regression:

\[ logit(Pi)=ln(\frac{Pi}{Preference}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k \]

LASSO Regression:

\[ \hat{\beta} = \arg\min_{\beta} \left( \frac{1}{2n} \sum_{i=1}^n (y_i - X_i \beta)^2 + \lambda \sum_{j=1}^p |\beta_j| \right) \] L1 Penalization Portion:

\[ \lambda \sum_{j=1}^p |\beta_j| \] Log Loss: \[ \text{Log Loss} = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] \]

K-Fold Cross Validation:

\[ \frac{1}{K} \sum_{k=1}^K \text{Err}_k \]

Data & Visualization

  • Missing Data and Imputation Consideration

  • K-Fold Cross Validation Reasoning

  • Model Results & Validation

Variables in Original Dataset

# Packages Used
 
library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)
library(dplyr)
library(DT)
library(glmnet)
library(fastDummies)
library(nnet)
library(plyr)
library(readxl)
#install.packages("naniar")
library(naniar) 
#install.packages("VIM")
library(data.table)
#install.packages("mltools")
library(mltools)
library(ggplot2)
library(reshape2)
library(tidyr)
library(purrr)
library(readr)

# Load Data

Variables <- read.csv("Variables - Sheet1.csv", header = TRUE)
Variables <- read_csv("https://raw.githubusercontent.com/noojoya/STA6257_Project_LASSO-Regression/main/train_genetics.csv")

Target Data from Original Dataset

Target <- read.csv("Target - Sheet1.csv", header = TRUE)

Missing Values

Imputation Consideration & Decision: N/A of missing data was ultimately used instead of an imputation method

Split: 20/80 ratio

Lambda Value: 0.0002451

Deviance: 99.9%

Accuracy: 0.494 provided by confusion matrix

df <- read_excel("train_genetics.xlsx")
# dim(df)
# glimpse(df)
df <- df[, !names(df) %in% c("Patient Id", "Patient First Name", "Family Name", "Father's name", "Institute Name", "Location of Institute", "Test 1", "Test 2", "Test 3", "Test 4", "Test 5", "Symptom 1", "Symptom 2", "Symptom 3", "Symptom 4", "Symptom 5", "Parental consent", "Follow-up", "H/O radiation exposure (x-ray)", "H/O substance abuse", "Birth asphyxia")]
# datatable(df)



VIM::aggr(df,prop=FALSE,numbers=TRUE)

#df <- df %>% filter(!is.na(df$`Mother's age`))
#df <- df %>% filter(!is.na(df$`Father's age`))
#VIM::aggr(df,prop=FALSE,numbers=TRUE)
#naniar::miss_var_summary(df)

dt <- na.omit(df)
dim(dt)
[1] 7176   24
sum(is.na(dt))
[1] 0
is.data.table(dt) # to see if data.table
[1] FALSE
dt <- as.data.table(dt)

# Convert Categorical data into numerical data

# Function to convert categorical variables to numeric
convert_to_numeric <- function(x) {
  if (is.numeric(x)) return(x)
  factor_x <- as.factor(x)
  as.numeric(factor_x)
}

# List of columns to convert (excluding already numeric columns and ID)

columns_to_convert <- c("Status", "Respiratory Rate (breaths/min)", "Heart Rate (rates/min", "Autopsy shows birth defect (if applicable)", "Place of birth", "Assisted conception IVF/ART", "History of anomalies in previous pregnancies","Birth defects", "Blood test result", "Genetic Disorder", "Disorder Subclass", "Gender")

# Convert specified columns to numeric
dt[, (columns_to_convert) := lapply(.SD, convert_to_numeric), .SDcols = columns_to_convert]

# Special handling for "Respiratory Rate" and "Heart Rate" columns
dt[, `Respiratory Rate (breaths/min)` := as.numeric(sub("Normal \\(30-60\\)", "Tachycardia", `Respiratory Rate (breaths/min)`))]
dt[, `Heart Rate (rates/min` := as.numeric(sub("Normal", "Tachycardia", `Heart Rate (rates/min`))]
dt[, `Gender` := as.numeric(sub("Male", "Female", `Gender`))]


# Convert Yes/No columns to 1/0
yes_no_columns <- c("Genes in mother's side", "Inherited from father", "Maternal gene", "Paternal gene", "Folic acid details (peri-conceptional)", "H/O serious maternal illness")

dt[, (yes_no_columns) := lapply(.SD, function(x) as.numeric(x == "Yes")), .SDcols = yes_no_columns]


VIM::aggr(dt,prop=FALSE,numbers=TRUE)

Statistical Analysis

# loading packages 
library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)
library(glmnet)
library(fastDummies)
library(nnet)

#Loading data 
genetic_data <- read_csv('train_genetics.csv')

#Removing NA from data
clean_gene_data <- na.omit(genetic_data)

#Creating dummy columns
clean_gene_data <- dummy_cols(clean_gene_data, select_columns = c("Respiratory Rate (breaths/min)","Gender" ,"Heart Rate (rates/min", "H/O radiation exposure (x-ray)", "Birth asphyxia", "H/O substance abuse", "Birth defects", "H/O substance abuse", "Blood test result", "Disorder Subclass"), remove_first_dummy = TRUE)

# for consistency
set.seed(234)

# Get row indices for the training set and setting split
train_indices <- sample(seq_len(nrow(clean_gene_data)), size = 0.8 * nrow(clean_gene_data))

# Split the data into 80:20
train_data <- clean_gene_data[train_indices, ] # should be about 80% of the original data
test_data <- clean_gene_data[-train_indices, ] # should be about 20% of the original data


#Defining the response variable
y <- as.factor(train_data$`Genetic Disorder`)


#Defining the matrix of predictor variables for the model
x <- data.matrix(train_data[,c ("Genes in mother's side",'Maternal gene','Paternal gene' ,'Inherited from father', 'Blood cell count (mcL)', 'Respiratory Rate (breaths/min)_Tachypnea', 'Heart Rate (rates/min_Tachycardia','Gender_Female', 'Gender_Male','Birth asphyxia_No record', 'Birth asphyxia_Not available','Birth asphyxia_Yes','Folic acid details (peri-conceptional)', 'H/O serious maternal illness', 'H/O radiation exposure (x-ray)_No', 'H/O radiation exposure (x-ray)_Not applicable', 'H/O radiation exposure (x-ray)_Yes', 'H/O substance abuse_No', 'H/O substance abuse_Not applicable', 'H/O substance abuse_Yes', 'Assisted conception IVF/ART', 'History of a0malies in previous pregnancies', 'Birth defects_Singular', 'Blood test result_inconclusive', 'Blood test result_normal', 'Blood test result_slightly abnormal','White Blood cell count (thousand per microliter)', 'Symptom 1', 'Symptom 2','Symptom 3','Symptom 4', 'Symptom 5', 'Test 1', 'Test 2','Test 3','Test 4', 'Test 5', 'Disorder Subclass_Cancer', 'Disorder Subclass_Cystic fibrosis', 'Disorder Subclass_Diabetes', 'Disorder Subclass_Hemochromatosis', "Disorder Subclass_Leber's hereditary optic neuropathy", 'Disorder Subclass_Leigh syndrome', 'Disorder Subclass_Mitochondrial myopathy', 'Disorder Subclass_Tay-Sachs')])


#Looking for optimal lambda value using k-fold cross validation
cross_val_model <- cv.glmnet(x, y, family = "multinomial", alpha= 1)

#Looking for the best lambda value 
min_lambda <- cross_val_model$lambda.min

#Value of Lambda
min_lambda
[1] 0.0002451145
#| echo: true 
#Graph of test MSE error
plot(cross_val_model)
#Creating LASSO regression model 
gene_model <- glmnet(x, y, family = "multinomial", alpha = 1, lambda = min_lambda)

#Coefficients of the model
coef(gene_model)
$`Mitochondrial genetic inheritance disorders`
46 x 1 sparse Matrix of class "dgCMatrix"
                                                             s0
                                                      -0.028792
Genes in mother's side                                 .       
Maternal gene                                          .       
Paternal gene                                          .       
Inherited from father                                  .       
Blood cell count (mcL)                                 .       
Respiratory Rate (breaths/min)_Tachypnea               .       
Heart Rate (rates/min_Tachycardia                      .       
Gender_Female                                          .       
Gender_Male                                            .       
Birth asphyxia_No record                               .       
Birth asphyxia_Not available                           .       
Birth asphyxia_Yes                                     .       
Folic acid details (peri-conceptional)                 .       
H/O serious maternal illness                           .       
H/O radiation exposure (x-ray)_No                      .       
H/O radiation exposure (x-ray)_Not applicable          .       
H/O radiation exposure (x-ray)_Yes                     .       
H/O substance abuse_No                                 .       
H/O substance abuse_Not applicable                     .       
H/O substance abuse_Yes                                .       
Assisted conception IVF/ART                            .       
History of a0malies in previous pregnancies            .       
Birth defects_Singular                                 .       
Blood test result_inconclusive                         .       
Blood test result_normal                               .       
Blood test result_slightly abnormal                    .       
White Blood cell count (thousand per microliter)       .       
Symptom 1                                              .       
Symptom 2                                              .       
Symptom 3                                              .       
Symptom 4                                              .       
Symptom 5                                              .       
Test 1                                                 .       
Test 2                                                 .       
Test 3                                                 .       
Test 4                                                 .       
Test 5                                                 .       
Disorder Subclass_Cancer                               .       
Disorder Subclass_Cystic fibrosis                      .       
Disorder Subclass_Diabetes                             .       
Disorder Subclass_Hemochromatosis                      .       
Disorder Subclass_Leber's hereditary optic neuropathy 10.148918
Disorder Subclass_Leigh syndrome                      10.132946
Disorder Subclass_Mitochondrial myopathy               9.690685
Disorder Subclass_Tay-Sachs                            .       

$`Multifactorial genetic inheritance disorders`
46 x 1 sparse Matrix of class "dgCMatrix"
                                                               s0
                                                       0.06377615
Genes in mother's side                                 0.04009867
Maternal gene                                          .         
Paternal gene                                          0.19224017
Inherited from father                                  0.11309765
Blood cell count (mcL)                                 .         
Respiratory Rate (breaths/min)_Tachypnea               .         
Heart Rate (rates/min_Tachycardia                      .         
Gender_Female                                          .         
Gender_Male                                            .         
Birth asphyxia_No record                               .         
Birth asphyxia_Not available                           .         
Birth asphyxia_Yes                                     .         
Folic acid details (peri-conceptional)                 .         
H/O serious maternal illness                           .         
H/O radiation exposure (x-ray)_No                      .         
H/O radiation exposure (x-ray)_Not applicable          .         
H/O radiation exposure (x-ray)_Yes                     .         
H/O substance abuse_No                                 .         
H/O substance abuse_Not applicable                     .         
H/O substance abuse_Yes                                .         
Assisted conception IVF/ART                            .         
History of a0malies in previous pregnancies            .         
Birth defects_Singular                                 .         
Blood test result_inconclusive                         .         
Blood test result_normal                               .         
Blood test result_slightly abnormal                    .         
White Blood cell count (thousand per microliter)       .         
Symptom 1                                              0.06942615
Symptom 2                                              0.47468900
Symptom 3                                              0.74021267
Symptom 4                                              0.96951046
Symptom 5                                              1.35989465
Test 1                                                 .         
Test 2                                                 .         
Test 3                                                 .         
Test 4                                                 .         
Test 5                                                 .         
Disorder Subclass_Cancer                               5.98053346
Disorder Subclass_Cystic fibrosis                      .         
Disorder Subclass_Diabetes                             4.81155092
Disorder Subclass_Hemochromatosis                      .         
Disorder Subclass_Leber's hereditary optic neuropathy -0.14134838
Disorder Subclass_Leigh syndrome                       .         
Disorder Subclass_Mitochondrial myopathy               .         
Disorder Subclass_Tay-Sachs                            .         

$`Single-gene inheritance diseases`
46 x 1 sparse Matrix of class "dgCMatrix"
                                                               s0
                                                      -0.03498415
Genes in mother's side                                 .         
Maternal gene                                          .         
Paternal gene                                          .         
Inherited from father                                  .         
Blood cell count (mcL)                                 .         
Respiratory Rate (breaths/min)_Tachypnea               .         
Heart Rate (rates/min_Tachycardia                      .         
Gender_Female                                          .         
Gender_Male                                            .         
Birth asphyxia_No record                               .         
Birth asphyxia_Not available                           .         
Birth asphyxia_Yes                                     .         
Folic acid details (peri-conceptional)                 .         
H/O serious maternal illness                           .         
H/O radiation exposure (x-ray)_No                      .         
H/O radiation exposure (x-ray)_Not applicable          .         
H/O radiation exposure (x-ray)_Yes                     .         
H/O substance abuse_No                                 .         
H/O substance abuse_Not applicable                     .         
H/O substance abuse_Yes                                .         
Assisted conception IVF/ART                            .         
History of a0malies in previous pregnancies            .         
Birth defects_Singular                                 .         
Blood test result_inconclusive                         .         
Blood test result_normal                               .         
Blood test result_slightly abnormal                    .         
White Blood cell count (thousand per microliter)       .         
Symptom 1                                              .         
Symptom 2                                              .         
Symptom 3                                              .         
Symptom 4                                              .         
Symptom 5                                              .         
Test 1                                                 .         
Test 2                                                 .         
Test 3                                                 .         
Test 4                                                 .         
Test 5                                                 .         
Disorder Subclass_Cancer                               .         
Disorder Subclass_Cystic fibrosis                     10.62654612
Disorder Subclass_Diabetes                             .         
Disorder Subclass_Hemochromatosis                      8.30586094
Disorder Subclass_Leber's hereditary optic neuropathy  .         
Disorder Subclass_Leigh syndrome                       .         
Disorder Subclass_Mitochondrial myopathy               .         
Disorder Subclass_Tay-Sachs                            9.02598863
print(gene_model)

Call:  glmnet(x = x, y = y, family = "multinomial", alpha = 1, lambda = min_lambda) 

  Df %Dev    Lambda
1 16 99.9 0.0002451
#Creating matrix of predictor variables from test data
x_test <- data.matrix(test_data[,c ("Genes in mother's side",'Maternal gene','Paternal gene' ,'Inherited from father', 'Blood cell count (mcL)', 'Respiratory Rate (breaths/min)_Tachypnea', 'Heart Rate (rates/min_Tachycardia','Gender_Female', 'Gender_Male','Birth asphyxia_No record', 'Birth asphyxia_Not available','Birth asphyxia_Yes','Folic acid details (peri-conceptional)', 'H/O serious maternal illness', 'H/O radiation exposure (x-ray)_No', 'H/O radiation exposure (x-ray)_Not applicable', 'H/O radiation exposure (x-ray)_Yes', 'H/O substance abuse_No', 'H/O substance abuse_Not applicable', 'H/O substance abuse_Yes', 'Assisted conception IVF/ART', 'History of a0malies in previous pregnancies', 'Birth defects_Singular', 'Blood test result_inconclusive', 'Blood test result_normal', 'Blood test result_slightly abnormal','White Blood cell count (thousand per microliter)', 'Symptom 1', 'Symptom 2','Symptom 3','Symptom 4', 'Symptom 5', 'Test 1', 'Test 2','Test 3','Test 4', 'Test 5', 'Disorder Subclass_Cancer', 'Disorder Subclass_Cystic fibrosis', 'Disorder Subclass_Diabetes', 'Disorder Subclass_Hemochromatosis', "Disorder Subclass_Leber's hereditary optic neuropathy", 'Disorder Subclass_Leigh syndrome', 'Disorder Subclass_Mitochondrial myopathy', 'Disorder Subclass_Tay-Sachs')])

#Defining response variable from test data 
y_test <- as.factor(test_data$`Genetic Disorder`)



#CONFUSION MATRIX BELOW 

#Creating prediction classes for the model
pred_prob <- predict(gene_model, newx = x_test, s = min_lambda, type = "class")

#Creating the prediction probability threshold for the model
pred_class <- ifelse(pred_prob > 0.05, 1, 0)

#Creating the confusion matrix
confusion_matrix <- table(Predicted = pred_class, Actual = y_test)

print(confusion_matrix)
         Actual
Predicted Mitochondrial genetic inheritance disorders
        1                                         663
         Actual
Predicted Multifactorial genetic inheritance disorders
        1                                          135
         Actual
Predicted Single-gene inheritance diseases
        1                              544
#Calculates the overall true positive/negatives over the sum of all predictions
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)

# Print the accuracy decimal
print(paste("Accuracy:", round(accuracy, 4)))
[1] "Accuracy: 0.494"
#Decimal proportion of true positives over true and false positive predictions
precision <- diag(confusion_matrix) / rowSums(confusion_matrix)

#Decimal proportion of true positives over true and false negative predictions
recall <- diag(confusion_matrix) / colSums(confusion_matrix)

#F1 Score measures accuracy and considers precision and recall and also accounts for false + and -
f1_score <- 2 * (precision * recall) / (precision + recall)
 

f1_score
 Mitochondrial genetic inheritance disorders 
                                   0.6613466 
Multifactorial genetic inheritance disorders 
                                   0.8977657 
            Single-gene inheritance diseases 
                                   0.7030753 

Conclusion

  1. Performance

Accuracy: 0.494,   Kappa: 1 → same with a coin toss

Effective variables: genes in mother’s side, paternal gene, inherited from father    

  1. Following the Process

Pre-processing → EDA → Analysis → Interpretation

  1. Deleting omitted rows and make the length → Check data types by Visualization

  2. Finding Lambda value → Model setup → LASSO regression → Confusion Matrix

  1. Lessons Learned

Did we extract and use proper variables?

The parameter used in the model followed the right recipe → feature selection needs improvement

What if only same class of variables were used to make a model? e.g., binary, categorical, continuous is separately used

Selected variables through the feature selection might have helped to improve

References

Claussnitzer, Melina, Judy H Cho, Rory Collins, Nancy J Cox, Emmanouil T Dermitzakis, Matthew E Hurles, Sekar Kathiresan, et al. 2020. “A Brief History of Human Disease Genetics.” Nature 577 (7789): 179–89.
Liu, Shuai, Mengye Lu, Hanshuang Li, and Yongchun Zuo. 2019. “Prediction of Gene Expression Patterns with Generalized Linear Regression Model.” Frontiers in Genetics 10: 120.
Liu, Yanqiu, Liangwei Mao, Hui Huang, Wei Li, Jianfen Man, Wenqian Zhang, Lina Wang, et al. 2023. “Clinical Diagnosis of Genetic Disorders at Both Single-Nucleotide and Chromosomal Levels Based on BGISEQ-500 Platform.” Human Genome Variation 10 (1): 15.
Ranstam, Jonas, and Jonathan A Cook. 2018. “LASSO Regression.” Journal of British Surgery 105 (10): 1348–48.
Raza, Ali, Furqan Rustam, Hafeez Ur Rehman Siddiqui, Isabel de la Torre Diez, Begoña Garcia-Zapirain, Ernesto Lee, and Imran Ashraf. 2022. “Predicting Genetic Disorder and Types of Disorder Using Chain Classifier Approach.” Genes 14 (1): 71.
Rusyana, A, KA Notodiputro, and B Sartono. 2021. “The Lasso Binary Logistic Regression Method for Selecting Variables That Affect the Recovery of Covid-19 Patients in China.” In Journal of Physics: Conference Series, 1882:012035. 1. IOP Publishing.
Tamburrano, Andrea, Doriana Vallone, Cinzia Carrozza, Andrea Urbani, Maurizio Sanguinetti, Nicola Nicolotti, Andrea Cambieri, and Patrizia Laurenti. 2020. “Evaluation and Cost Estimation of Laboratory Test Overuse in 43 Commonly Ordered Parameters Through a Computerized Clinical Decision Support System (CCDSS) in a Large University Hospital.” PLoS One 15 (8): e0237159.